A Synthetic Biology Approach for Vaccine Candidate Design against Delta Strain of SARS-CoV-2 Revealed Disruption of Favored Codon Pair as a Better Strategy over Using Rare Codons

The SARS-CoV-2 delta variant (B.1.617.2) appeared for the first time in December 2020 and later spread worldwide. Currently available vaccines are not so efficacious in curbing the viral pathogenesis of the delta strain of COVID; therefore, the development of a safe and effective vaccine is required. In the present study, we envisaged molecular patterns in the structural genes’ spike, nucleoprotein, membrane, and envelope of the SARS-CoV-2 delta variant. The study was based on determining compositional features, dinucleotide odds ratio, synonymous codon usage, positive and negative codon contexts, rare codons, and insight into relatedness between the human host isoacceptor tRNA and preferred codons from the structural genes. We found specific patterns, including a significant abundance of T nucleotide over all other three nucleotides. The underrepresentation of GpA, GpG, CpC, and CpG dinucleotides and the overrepresentation of TpT, ApA, CpT, and TpG were observed. A preference towards ACT- (Thr), AAT- (Asn), TTT- (Phe), and TTG- (Leu) initiated codons and aversion towards CGG (Arg), CCG (Pro), and CAC (His) was present in the structural genes of the delta strain. The interaction between the host tRNA pool and preferred codons of the envisaged structural genes revealed that the virus preferred the codons for those suboptimal numbers of isoacceptor tRNA were present. We see this as a strategy adapted by the virus to keep the translation rate low to facilitate the correct folding of viral proteins. The information generated in the study helps design the attenuated vaccine candidate against the SARS-CoV-2 delta variant using a synthetic biology approach. Three strategies were tested: changing TpT to TpA, introducing rare codons, and disrupting favored codons. It found that disrupting favored codons is a better approach to reducing virus fitness and attenuating SARS-CoV-2 delta strain using structural genes.


Introduction
Classically the live attenuated vaccine candidates are prepared through serially passaging it in non-optimal conditions, which leads to the attenuation of the virus [1]. Vaccine candidate generation through this method is a lengthy and stochastic process. For human poliovirus, the Sabin vaccine, and for the Rinderpest virus, the Plowright vaccine has been developed. Some key mutations are responsible for virus attenuation. However, Vaccines are reported to be efficacious against infectious diseases, as shown by clinical trials [29]. The data suggested that the vaccination gave protection against severe disease outcomes in the case of the B.1.1.7 (alpha) variant, isolated first in the United Kingdom [30]. For the B.1.351 (beta) variant, effectiveness against the severe disease was reported to be low; however, it still reduced severe and fatal outcomes in individuals vaccinated with the BNT162b2 vaccine [31]. The same BNT162b2 vaccine exhibited a high level of neutralization against the P.1 (gamma) variant [32].
Delta variant possesses mutations in the spike region of the virus (spike protein mutations T19R, ∆157-158, L452R, T478K, D614G, P681R, and D950N). These mutations enable the delta strain to escape the immune response. The data is limited on protection by vaccination with BNT162b2 and ChAdOx1 nCoV-19 against symptomatic delta strain. David W. Eyre (2022) [33] reported that two vaccinations with either BNT162b2 or ChAdOx1 nCoV-19 caused a small reduction in delta variant transmission than the alpha variant, as evidenced by a partial reduction in the PCR Ct values. The effectiveness of the BNT162b2 vaccine against symptomatic COVID-19 was 57% after the first vaccine dose in adolescents [34].
Furthermore, Luo CH and Morris CP (2021) [35] reported the delta variant as a cause of higher infectious virus loads in both vaccinated and unvaccinated individuals. Considering the partial effectiveness of available vaccines against the delta strain, it is essential to design strategies that effectively target the delta variant. In the present study, we attempted to gain insights into molecular patterns present in the structural genes (spike (S), nucleoprotein (N), membrane (M), and envelope (E)) of the COVID delta strain, which can be explored into the synthetic biology approach to develop a vaccine candidate.

Sequence Retrieval
We retrieved the sequences from the National Center for Biotechnological Information (NCBI) for the SARS-CoV-2 delta strain. The sequences taken were collected between January 2022 to July 2022. SARS-CoV-2 is the plus-sense, single-stranded viral RNA genome that encodes open-reading-frames (ORFs) for sixteen non-structural proteins that form the replication machinery (ORF1a/ORF1b), four structural proteins (spike (S), nucleoprotein (N), membrane (M) and envelope E)), and seven accessory proteins. Both the structural and non-structural proteins can be targeted for virus attenuation. Still, we chose to focus on structural proteins since these proteins are essential for the host cells' binding and invasion, and immune response against them will be able to provide an effective barrier against viral binding and invasion in the host cell.
A total of 190 sequences for each structural gene encoding for the spike, nucleoprotein, membrane, and envelope were obtained. All selected sequences did not contain any ambiguous sequences, started with ATG and ended with stop codons TAA, TAG, or TGA, and were present in a triplet (The accession numbers of the sequences are given in Supplementary Table S1). For the convenience of study for recoding of envisaged genes, we took the E, M, NP and S genes from the delta virus strain (assession numberOM982659.1). Sequences for the E, M, N, and S genes of representative sarbecoviruses were taken from the work of Llanes et al. (2020) [36] in the study that included Bat SARS-like CoV RaTG13 (MN996532), Bat SARS-like CoV HKU3 (DQ022305), Bat SARS-like CoV SL-CoVZC45 (MG772933), Bat SARS-like CoV SL-CoVZXC21 (MG772934), Bat SARS-like CoV WIV1 (KF367457), SARS-CoV (Human, NC_004718), SARS-CoV (Civet, AY686863), SARS-CoV-2 (Human, NC_045512), SARS-CoV-2 (Tiger, MT365033), and Pangolin CoV (MT040333). Representative sequences from VOCs Alpha (MZ622337), Beta (MZ344999), Gamma (MZ477758), and Omicron (OQ084152) were also included.

Odds Ratio Analysis
The odds ratio is expected to observe dinucleotide frequency in a given nucleotide sequence. Various factors affect the odds ratio, including nucleotide composition [37], evolutionary forces [38], forces required to maintain RNA secondary structures involved in splicing and gene expression [39], and forces to evade host defense mechanism (specific context is CpG where viral pathogens avoid CpG since CpG dinucleotide are perceived as pathogen-associated molecular patterns by host cells, and viral pathogens tend to decrease CpG content [40]. For the four genes, the dinucleotide odds ratio was calculated using DNASTAR Lasergene Inc. An odds ratio value of 0.78 and 1.23 is considered underrepresentation and overrepresentation of dinucleotides, respectively [41].

Relative Synonymous Codon Usage (RSCU) Analysis
RSCU help in knowing the preferred and non-preferred codons. The relative synonymous codon usage (RSCU) is calculated using the formula. RSCU = S × Nc/Na where S = the number of synonymous codons encoding the same amino acid. Nc = the frequency of the codon in the genome. Na is the relative frequency of the codon for that amino acid.

Codon Context Analysis
The effects of adjacent sequences on protein translation are called context effects. There are experimental pieces of evidence suggestive of the effects of context on nonsense suppression, missense suppression, translational errors, and frameshifting, which is further supported by statistical analysis that explain that the context around codons is not random [42]. Also, the codon context affects translational kinetics [43]. Codon context analysis was done using Anaconda software 2 ® [44]. The context was evaluated into the matrix of 64 × 64, where stop codons were included, and the direction was kept 5 to 3 .

High Occurring Codon Pairs
The efficient translation is dependent on the usage of codons and codon pairs. Some synonymous codons are used more in comparison to other codons, which is called codon bias. Similarly, some codon pairs are also frequently used and referred to as codon pair bias. An example is codon pair GCA-GAG, which is preferentially used to encode amino acid pair alanine-glutamic acid compared to GCC-GAA [45]. Both codon deoptimization and codon pair deoptimization are used to attenuate viruses [9,10,46]. High-occurring codon pairs were determined by Anaconda 2.0 version assessed on 24 July 2022. The generated report was trimmed, and the top 20 high-occurring codon pairs were taken.

Rare Codon Analysis
The usage of rare codons in the reading frame is used to control the translation rates and adopt an intermediate confirmation to attain proper protein folding [47]. In a few instances, substituting rare codons with the optimal one resulted in protein misfolding and affecting solubility [48] and, eventually, loss of biological activity [49]. The presence of rare codons might be tissue-specific [50] and indicative of translational programming of cell proliferation [51]. The number of rare codons was calculated for all 4 genes and normalized to get percent occurrence. An occurrence of less than 0.5% was set as a criterion to be rare codons, and above than 5% was considered abundant codons.

Codon Pair Score
Codon pair bias can be quantified using the codon pair score (CPS) statistics [45]. The codon pair bias (CPB) indicates the bias present in the codon pair, and it is the mean of the codon pair scores (CPSs) for all of its codon pairs present in an ORF or gene. In turn, the CPS for each codon pair is the natural log of the ratio of the observed versus expected frequency of that codon pair [52]. Statistically, underrepresented codon pairs have negative, and overrepresented codon pairs have positive CPS values. The average CPS of a gene is calculated as the arithmetic mean of individual CPS values. The lower the value of CPS, the more the virus will be attenuated.  [53]. Genome-wide RNA decay analysis revealed that stable mRNA are generally rich in optimal codons and result in high gene expression, while unstable mRNA encompass predominately nonoptimal codons. In unstable RNAs, more than 60% of the codons are non-optimal [54]. The higher the stability of an mRNA in the cytoplasm, the higher quantities of proteins will be produced. Several algorithms like mfold, KineFold, and ViennaRNA are used to predict plausible mRNA structures. The software computes the mRNA thermodynamic stability value in the form of minimum free energy (MFE), a thermodynamic energy measurement based on intramolecular stacking, the system's temperature, entropy, enthalpy, and ionic conditions, and hydrogen bond interactions [55]. A lower MFE depicts more stable mRNA [56], and unstable mRNA structures have more than 60% non-optimal codons [54]. Less stable transcripts will have fewer negative values, and with more negative values, better-fit progeny will be generated. The RNAfold server was used to calculate the transcript's minimum free energy (MFE).

Codon Adaptation Index (CAI) Calculation
CAI is a common evaluation measure of protein expression [57]. CAI alone is not very comprehensive but imperative to determine gene expression [58]. CAI values were calculated using the CAICal served developed by Puigbò and colleagues [57].

Compositional Features of SARS-CoV-2 Delta Strain Structural Genes Revealed at Richness
The nucleotide composition of any genome is responsible for mutational robustness, which indicates the capacity to withstand mutations exhibiting no or slight variation in phenotype upon introducing mutations [59] and influence codon usage [60]. Disproportionate base composition accounts for much of codon usage in RNA viruses [61]. In the present study, the structural genes of the delta strain of the SARS-CoV-2 virus were studied for their nucleotide composition ( Table 1). The average nucleotide composition of the gene indicated that the genes were AT-rich, with an abundance of T nucleotide (except for the M gene); however, at the third codon position, all the genes have richness in T nucleotide.

Odds Ratio Analysis Indicated Both under and Overrepresentation of Some Mirror Dinucleotides
The odds ratio analysis revealed that TpT and CpA showed maximum variation in values with standard deviations of 0.63 and 0.51, respectively. The average TpT dinucleotide value ranged between 0.867-2.383, while CpA ranged between 0.359-1.569 for the four structural genes of the delta virus. Since the variation was in higher ranges, the deviation was high.
The least deviation was observed for TpG, ApC, and ApG dinucleotides. TpT dinucleotide was underrepresented in the NP gene, while in other genes, it is overrepresented. CpG, as expected, was underrepresented in all the genes. Based on the average odds ratio, it was evident that GpA, GpG, CpC, and CpG dinucleotides were underrepresented (odds ratio < 0.78), while TpT, ApA, CpT, and TpG were overrepresented (odds ratio > 1.23). Here CpC and GpG and; TpT and ApA are the mirror dinucleotides that are underrepresented and overrepresented, respectively. From the figure, it is evident that the odds ratio of E, M and NP genes are somewhat similar, while for the NP gene, there is little difference (Figure 1). values with standard deviations of 0.63 and 0.51, respectively. The average TpT dinucleotide value ranged between 0.867-2.383, while CpA ranged between 0.359-1.569 for the four structural genes of the delta virus. Since the variation was in higher ranges, the deviation was high.
The least deviation was observed for TpG, ApC, and ApG dinucleotides. TpT dinucleotide was underrepresented in the NP gene, while in other genes, it is overrepresented. CpG, as expected, was underrepresented in all the genes. Based on the average odds ratio, it was evident that GpA, GpG, CpC, and CpG dinucleotides were underrepresented (odds ratio < 0.78), while TpT, ApA, CpT, and TpG were overrepresented (odds ratio > 1.23). Here CpC and GpG and; TpT and ApA are the mirror dinucleotides that are underrepresented and overrepresented, respectively. From the figure, it is evident that the odds ratio of E, M and NP genes are somewhat similar, while for the NP gene, there is little difference (Figure 1).

Dinucleotide Bias at the Junction of Codons
Dinucleotide bias at the junction of codons (C at 3rd position of 5′ codon and G at 1st codon position in subsequent 3′ codon and similarly T at 3rd position of 5′ codon and A at 1st codon position in subsequent 3′ codon, is called p3-1 junction), were evaluated for CpG and TpA and demonstrated in Figure 2. Positive and negative contexts were found for CpG and TpA at the p3-1 junction, though the negative context was more prominent than the positive one ( Figure 2). The negative context for CpG at the junction was mainly present in the S and N genes. In the E gene, only positive, while in the M gene, both positive and negative contexts were present depending on the amino acid. For TpA dinucleotide at the junction, in gene S, a highly negative context was present, followed by M and N genes. Similar to the CpG junction, the TpA junction was present in a positive context

Dinucleotide Bias at the Junction of Codons
Dinucleotide bias at the junction of codons (C at 3rd position of 5 codon and G at 1st codon position in subsequent 3 codon and similarly T at 3rd position of 5 codon and A at 1st codon position in subsequent 3 codon, is called p3-1 junction), were evaluated for CpG and TpA and demonstrated in Figure 2. Positive and negative contexts were found for CpG and TpA at the p3-1 junction, though the negative context was more prominent than the positive one ( Figure 2). The negative context for CpG at the junction was mainly present in the S and N genes. In the E gene, only positive, while in the M gene, both positive and negative contexts were present depending on the amino acid. For TpA dinucleotide at the junction, in gene S, a highly negative context was present, followed by M and N genes. Similar to the CpG junction, the TpA junction was present in a positive context only in the E gene. Overall analysis revealed that at the junction, all kinds of contexts (no context, positive context, negative context) were present; however, in the S gene, negative contexts were more prominent for the TpA junction, which could be the result of selection forces. Our results concord with the results obtained by Beutler et al., 1989 [62].
A similar result was obtained on Ustilago, a fungal parasite of grasses, where extensive codon context analysis revealed avoidance of TpA at codon-codon junctions, and possibly it is attributed to reducing the risk of nonsense mutations resulting in a stop codon and abrupt chain termination [63] and affecting subsequent translation fitness as a part of the selection [64]. In contrast to our result, TpA was the second most abundant dinucleotide at the junction in Human Rhinoviruses A, B, and C [65]. CpG frequency is dropped in the Influenza A virus at the p3-1 junction [66], and it is additional to the intracodon CpG component, where all CpG-containing codons were underrepresented [67].
TpA and CpG underrepresentation at the p3-1 junction suggested that codon choices alone may not explain the scarcity of TpA and CpG since, at this position, at least TpA has no defined coding function in this frame and is the result of multiple forces, including immune pressure [67,68], high mutability resulting in a transition from CpG to TpG [69], selection forces [70], TpA having mRNA destabilizing effect [62], higher susceptibility of UpA to cytoplasmic RNase [71] and evading interferon-inducible protein ZAP and RNAseL as host protein responsible for sensing CpG in viral RNA.

RSCU Values of Codons from Four Structural Genes Revealed That for All Genes; Preferred Codons Are Not the Same
Relative synonymous codon usage (RSCU) is one of the imperative parameters for evaluating the codon bias present in synonymous codons. It represents the expected occurrence frequency of any codon out of all synonymous codons for a particular amino acid, multiplied by the degeneracy level and suggestive of codon priority among synonymous codons encoding for a single amino acid [72]. A higher RSCU values suggest a preferred codon, while lower values are indicative usage of non-preferred codons.
The RSCU values below 0.6 suggest low occurrence, while values above 1.6 suggest vice versa. CpG and TpA suppression is expected owing to facts mentioned in this article in the above section and observed in vertebrate viruses also [73]. CpG and TpA dinucleotide suppression is reflected in the CpG and TpA encompassing codons, and the same is evidenced by RSCU analysis of codons in various virus models. All eight CpG- TpA and CpG underrepresentation at the p3-1 junction suggested that codon choices alone may not explain the scarcity of TpA and CpG since, at this position, at least TpA has no defined coding function in this frame and is the result of multiple forces, including immune pressure [67,68], high mutability resulting in a transition from CpG to TpG [69], selection forces [70], TpA having mRNA destabilizing effect [62], higher susceptibility of UpA to cytoplasmic RNase [71] and evading interferon-inducible protein ZAP and RNAseL as host protein responsible for sensing CpG in viral RNA.

RSCU Values of Codons from Four Structural Genes Revealed That for All Genes; Preferred Codons Are Not the Same
Relative synonymous codon usage (RSCU) is one of the imperative parameters for evaluating the codon bias present in synonymous codons. It represents the expected occurrence frequency of any codon out of all synonymous codons for a particular amino acid, multiplied by the degeneracy level and suggestive of codon priority among synonymous codons encoding for a single amino acid [72]. A higher RSCU values suggest a preferred codon, while lower values are indicative usage of non-preferred codons.
The RSCU values below 0.6 suggest low occurrence, while values above 1.6 suggest vice versa. CpG and TpA suppression is expected owing to facts mentioned in this article in the above section and observed in vertebrate viruses also [73]. CpG and TpA dinucleotide suppression is reflected in the CpG and TpA encompassing codons, and the same is evidenced by RSCU analysis of codons in various virus models. All eight CpG-encompassing codons are found to be underrepresented in the Nipah virus [74]. RSCU values analysis of six codons containing TpA (TTA, CTA, ATA, GTA, TAT, and TAC) indicated that these are not preferred in Mycoviral genes [75]. HCV also showed a significant tendency to not prefer the codons with CpG or TpA dinucleotides [76], and many researchers have reported similar results [77][78][79][80] establishing a correlation between the presence of CpG and TpA and lower RSCU.
The RSCU value is independent of the amino acid composition of any gene and hence helps compare different genes [81]. Values near 1 suggest the unbiased use of codons [82]. In a synthetic recoded virus vaccine candidate construct, the higher RSCU values codons must be replaced with the lower RSCU valued codon. The RSCU values for each of the genes envisaged and given in Table 2 below. The analysis indicated that each gene codon usage pattern is different. The most preferred codon among synonymous codons is given in bold.

Codon Usage Comparison for Other Variants of Concern (VOCs) of SARS-CoV-2 and Representative Sarbecoviruses
The average RSCU value of SARS-CoV-2 VoCs and representative Sarbecoviruses is given in Table 3 and compared with delta strain. The analysis revealed that though the RSCU values for each codon slightly differed for different strains, the preferred codon choice remained the same for all amino acids in all envisaged viruses excluding phenyl alanine. Table 3. Average RSCU values for representative SARS-CoV-2 VoCs and Sarbecoviruses. The most preferred codon for an amino acid from the synonymous codon family is given in bold.

Codons
Amino

ACT-, AAT-, TTT-and TTG-Initiated Codons Were Preferred in at Least Three out of Four Genes
Codon pair deoptimization (CPD) is an efficient virus attenuation technique where suboptimal pair of codons is used, and synonymous codons are changed so that amino acid composition remains the same and hence the antigenicity [7]. The ultimate goal of codon swapping is to increase the number of underrepresented codon pairs in the virus's genes. The strategy has been implicated in attenuating viruses for making vaccine candidates, including human respiratory syncytial virus [83], porcine reproductive and respiratory syndrome viruses [84], enterovirus A71 [85], and dengue virus 2 [86], and the list is long. In the present study, we presented both the high occurring codon pairs and lowoccurring codons so that the high-occurring pairs may be disrupted with the low-occurring codons. Table 4 presents the top 20 most preferred codon pairs in the envisaged genes. Analysis revealed that among the top 20 most preferred codon pairs, ACT-(Threonine) AAT-(Asparagine), TTT-(Phenyl alanine) and TTG-(Leucine) initiated codons were preferred in at least three genes out of four envisaged. On the other hand, GTT-, GGA-and CTTinitiated (Val, Gly and Leu) codons were preferred in at least two genes.

Preferred Codon Pair Analysis in Sarbecoviruses and Other SARS-CoV-2 VoCs
Delta virus structural genes were compared with the other strains of SRAS-CoV-2 and sarbecoviruses, and the top 20 codon pairs for each of the genes are given in Table 5A-D. When the E gene of all envisaged strains was compared, the analysis revealed that in delta and other strains, Phenylalanine-, Leucine-, Serine-, and Tyrosine-initiated codons are preferred. The difference was in Valine-initiated codons, which were abundant in delta (05 valine-initiated codon pairs), while in other strains of SARS-CoV-2 Serine initiated (06), codon pairs were preferred (Table 5A). For the M gene, all the strains envisaged preferred Phenyl alanine-initiated and Leucine initiated codons. However, the number of Phenyl alanine-initiated codon pairs was less (04) in the delta strain compared to 06 Phenyl alanineinitiated codon pairs in others, including Sarbecoviruses. Also, in delta Leucine-initiated, 05 codon pairs were present, while their number was 07 in other viruses (Table 5B). For the N gene, Glycine-initiated codon pairs (05) were preferred in SARS-CoV-2 VoCs, excluding Sarbecoviruses and delta. In the delta, only 03 codon pairs were Glycine-initiated, while in Sarbecoviruses, Lysine-initiated codon pairs were preferred (04) ( Table 5C). For the S gene, the delta strain and other SARS-CoV-2 strains Glycine-initiated (≥03) codon pairs were preferred. In Sarbecoviruses, no such clear pattern was observed. Furthermore, in Omicron, Gycine-initiated (04) codon pairs were preferred (Table 5D). Based on the analysis, it can be said that since the codon preference is the same for all VoCs, including delta and Sarbacoviruses, the choice of the preferred codons is also similar to some extent. However, for codon pairs, the choice differed to some extent when the delta was compared with others. The difference may result from complex molecular interactions or signature molecular patterns. Since we included only the top 20 codon pairs in the study, other shared codon pairs between the viruses are possible.    IA  IA  IA  IA  IA  IA  IA  IA  IA  IA  CL  CL  CL  CL  CL  CL  CL  CL  CL  CL  GA  GA  GA  GA  GA  GA  GA  GA  GA  GA  AC  LV  AC  LV  AC  LV  LV  LV  LV  LV  LV  LL  LV  LL  LV  LL  LL  LL  LL  LL  LL  LR  LL  LR  LL  LR  LR  LR  LR  LR  LR  MW  LR  MW  LR  MW  MW  MW  MW  MW  MW  TI  MW  TI  MW  TI  TI  TI  TI  TI TS  TS  TS  TS  GA  TS  TS  TS  SN  TS  TN  TN  TN  TN  GD  TN  TN  TN  SF  TN  SN  SN  SN  SN  GF  SN  SN

Codon Context Revealed Highest Codon Pair Bias in Spike Protein
A substantial bias is present during codon pair utilization, called dicodon bias or codon context. It is a well-recognized phenomenon and is considered to arise from GC-biased gene conversion [87]. It is a direct cause of dinucleotide bias [18]. We performed codon context analysis for four genes of SARS-CoV-2, and all kinds of contexts ((negative (residual values less than −5), positive (residual values more than +5), insignificant (residual values between −5 and 5), and no context (residual zero)) were found in these genes. The insignificant context was absent in the envelope gene, while in the spike gene, the maximum positive and negative codon pair biases were present ( Figure 3A-D). codon context [90], and nonsense and missense suppression, elongation rate, the precision of tRNA selection and polypeptide chain termination all appeared to be affected by codon context [91]. Since the size of the S gene is the largest among all the envisaged genes, we speculate that the above-stated factors will be more operative on larger genes due to the very nature of the longer gene. Furthermore, at least in our envisaged genes, we found the same pattern, and codon context bias increased with the size of the gene. Therefore, the comparatively larger size of the S gene is attributed to maximum codon context bias, and a more positive context may be a molecular signature of the S gene.  The two-color scale shows the codon context bias. Strongly preferred codon context bias is depicted as pink, while strongly rejected codon context is depicted as green. In the case where the 3′ context is not strongly biased, it is depicted as black. The Grey color shows the presence of no context. The color scale is given in the figures.

Codons CGG (Arg), CCG (Pro) and CAC (His) Were Rare in all the Genes
Rare codons are not randomly present inside the mRNA sequence, indicating operative selective forces. Rare codons help initiate proper protein folding in nascent peptides and prevent the formation of secondary structures in mRNA in the 5′ region [92]. Like optimal codons, rare codons are also maintained through evolutionary forces. The incorporation of rare codons has been shown to reduce the translation of poliovirus capsid protein resulting in virus attenuation [47]. Using Anaconda2 software, we calculated the number of rare codons and then normalized them with gene length. An occurrence rate below 0.5% was considered a rare gene codon, which is a default value given by Ana-conda2 software.
CGG is the rarest codon in the SARS-CoV-2 genome, and inserting two tandem CGG codons in the spike protein might result in ribosome pausing at rare codons. Ribosomal pausing has a role in the efficient regulation of protein expression and co-translational subdomain folding [93]. Codons CGG (Arg), CCG (Pro) and CAC (His) were rare in all the genes. GGG (Gly), CCC (Pro), and TCG (Ser) codons were rare in at least three genes ( Figure 4). Codon CAA is highly used in NP (>5% of total codons) while used less than 1% in E and S genes. The two-color scale shows the codon context bias. Strongly preferred codon context bias is depicted as pink, while strongly rejected codon context is depicted as green. In the case where the 3 context is not strongly biased, it is depicted as black. The Grey color shows the presence of no context. The color scale is given in the figures.
The E, M, N, and S genes encode structural proteins [88]. The S gene plays a crucial role in receptor recognition and cell membrane fusion [89]. The sizes of the E, M, N, and S genes are 228bp, 669bp, 1260bp, and 3822 bp, respectively. Translational selection shapes codon context [90], and nonsense and missense suppression, elongation rate, the precision of tRNA selection and polypeptide chain termination all appeared to be affected by codon context [91]. Since the size of the S gene is the largest among all the envisaged genes, we speculate that the above-stated factors will be more operative on larger genes due to the very nature of the longer gene. Furthermore, at least in our envisaged genes, we found the same pattern, and codon context bias increased with the size of the gene. Therefore, the comparatively larger size of the S gene is attributed to maximum codon context bias, and a more positive context may be a molecular signature of the S gene.

Codons CGG (Arg), CCG (Pro) and CAC (His) Were Rare in All the Genes
Rare codons are not randomly present inside the mRNA sequence, indicating operative selective forces. Rare codons help initiate proper protein folding in nascent peptides and prevent the formation of secondary structures in mRNA in the 5 region [92]. Like optimal codons, rare codons are also maintained through evolutionary forces. The incorporation of rare codons has been shown to reduce the translation of poliovirus capsid protein resulting in virus attenuation [47]. Using Anaconda2 software, we calculated the number of rare codons and then normalized them with gene length. An occurrence rate below 0.5% was considered a rare gene codon, which is a default value given by Anaconda2 software.
CGG is the rarest codon in the SARS-CoV-2 genome, and inserting two tandem CGG codons in the spike protein might result in ribosome pausing at rare codons. Ribosomal pausing has a role in the efficient regulation of protein expression and co-translational subdomain folding [93]. Codons CGG (Arg), CCG (Pro) and CAC (His) were rare in all the genes. GGG (Gly), CCC (Pro), and TCG (Ser) codons were rare in at least three genes ( Figure 4). Codon CAA is highly used in NP (>5% of total codons) while used less than 1% in E and S genes.  For comparison among different strains of SARS-CoV-2 with delta strain, rare codon analysis was carried out considering all four structural genes as one sequence for each of the viral strains, and the number of rare codons was normalized. ACG, CAC, CCG, CGA, CGG, CGC, GCG, GGG, and TCG were rare in the delta and all other envisaged strains and had a frequency below 0.5%. Only the CGC codon frequency was slightly higher than 0.5% (0.62%) in Sarbacoviruses. We then performed pairwise comparisons between the codon frequencies and found no statistically significant difference. The analysis indicated that for all the VoCs, including delta and Sarbacoviruses, nine codons are rare.

Codon Preference of SARS-CoV-2 Gene Delta Sequences Is towards Rare Human Isoacceptor tRNAs
It is suggested that suboptimal usage of isoacceptor host tRNAs helps slow and gradual translation of viral proteins to ensure correct folding [94]. Identification of the most preferred codons (for each amino acid) in the envisaged structural genes of the SARS-CoV-2 delta strain and the most abundant isoacceptor tRNAs in human cells revealed that only for ILeu codon, the preferred codon is matched with the respective most abundant isoacceptor tRNAs in human hosts (Table 6). Other than Ileu, out of 18 amino acids, only four amino acids (Phe, Leu, Ala, and Tyr) preferred codons were matched with abundant isoacceptor tRNA (in three genes out of four). The results suggested that the codons preferred by the envisaged genes of the delta strain of SARS-CoV-2 do not match the abun- For comparison among different strains of SARS-CoV-2 with delta strain, rare codon analysis was carried out considering all four structural genes as one sequence for each of the viral strains, and the number of rare codons was normalized. ACG, CAC, CCG, CGA, CGG, CGC, GCG, GGG, and TCG were rare in the delta and all other envisaged strains and had a frequency below 0.5%. Only the CGC codon frequency was slightly higher than 0.5% (0.62%) in Sarbacoviruses. We then performed pairwise comparisons between the codon frequencies and found no statistically significant difference. The analysis indicated that for all the VoCs, including delta and Sarbacoviruses, nine codons are rare.

Codon Preference of SARS-CoV-2 Gene Delta Sequences Is towards Rare Human Isoacceptor tRNAs
It is suggested that suboptimal usage of isoacceptor host tRNAs helps slow and gradual translation of viral proteins to ensure correct folding [94]. Identification of the most preferred codons (for each amino acid) in the envisaged structural genes of the SARS-CoV-2 delta strain and the most abundant isoacceptor tRNAs in human cells revealed that only for ILeu codon, the preferred codon is matched with the respective most abundant isoacceptor tRNAs in human hosts (Table 6). Other than Ileu, out of 18 amino acids, only four amino acids (Phe, Leu, Ala, and Tyr) preferred codons were matched with abundant isoacceptor tRNA (in three genes out of four). The results suggested that the codons preferred by the envisaged genes of the delta strain of SARS-CoV-2 do not match the abundant tRNA pool in the human body (Table 6). Table 6. Table for

Vaccine Candidate Designing Using Information Generated in the Study
Viral fitness may be reduced by introducing rare codons for the virus [14], introducing the codons that are one substitution away from stop codons [95], and deoptimizing codon pairs [96]. Based on the analysis of envisaged genes, authors constructed three vaccine candidates (only the envisaged structural gene included), and those constructs were analyzed systemically for viral fitness. The first construct was based on the information that our sequences are TT and AA dinucleotide rich, and attenuation is correlated to an increase in TA content and a decrease in TpT and ApA dinucleotide [53]. Therefore we recoded three overrepresented TT-containing codons (CTT, GTT, and CTT codons) and replaced them with low-occurrence TA-containing codons (Table 7). While designing the second construct, we replaced abundant codons with rare codons common to the envisaged genes ((Codons CGG (Arg), CCG (Pro), CAC (His) GGG (Gly), CCC (Pro), and TCG (Ser) were introduced)). Finally, in the third construct, we disrupted preferred codon pairs (Table 7; ACT-, AAT-, TTT-, TTG-GTT-, GGA, and CTT-initiated; only 5' codon was deoptimized from the favored codon pair).
Van Leuven and colleagues [13] verified in phage 174 that the folding stability of the deoptimized codon mRNA is the best predictor of virus fitness, followed by CAI. Furthermore, in the experimentation of Groenke and colleagues, it was proved that with the lowest codon pair score, the highest virus attenuation is obtained [21]. Virus fitness by codon deoptimization is correlated to the amount of recoding performed, and codon deoptimization taking only one feature CAI (ignoring mRNA stability and codon pair score) doesn't result in sufficient attenuation [13]. Thus, systemically, we used all three parameters to assess our construct's fitness. mRNA stability was highest (folding energy −1801.30 kcal/mol) for the construct where rare codons were introduced, and it was even more negative than the native construct. Higher negative values exhibited higher virus fitness (though CAI was the least and CPS was the lowest, exhibiting attenuation). MFE was low for the construct recoded with TA ending codons (−1684.4 kcal/mol). The effect is likely owing to the mRNA destabilizing effect of TA [97]; however, in this construct, the CAI value was not much less, and the CPS score was also similar to that of the wild type. Disruption of favored codon pairs resulted in reduced protein expression (low CAI), low CPS and low mRNA stability (all three parameters we tested). Also, to construct three out of seven codons, only two codons have abundant corresponding isoacceptor tRNA (Table 6), and all remaining five isoaccptortRNA were suboptimal. Therefore from our analysis, construct three recoded where the favored codon pair is disrupted by introducing rare codons emerged as the most suitable candidate. In future studies, one may further incorporate changes to have better deoptimization. Here it is noteworthy that virus fitness is a complex term and results from many epistatic and genetic factors, which we ignored here due to the study limitations.

CpG Suppression in Different Constructs
Zinc finger antiviral protein (ZAP) powerfully restricts the viruses with elevated CpG and TpA dinucleotide frequencies [98,99], and the same is proved by knock-out experiments where attenuation in CpG-and UpA-high viruses was reversed in ZAP knock-out cell lines. CpG suppression in RNA and reverse transcribing viruses previously reported to be ZAP sensitive with odds ratio Sindbis virus (0. In contrast, the odds ratio was less for ZAP-insensitive HIV-1 (0.21), the Yellow fever virus (0.38), and the Vesicular stomatitis virus (0.48) [98], suggesting that for higher odds ratios, the virus becomes ZAP-sensitive. For our constructs, the odds ratios were 0.268, 0.268, 0.635, and 0.63 for the wild-type delta construct and constructs 1, 2, and 3, respectively, with the highest odds ratio of 0.635 for construct 2 reported. It indicates the ZAP sensitivity of recoded constructed 2 and 3. The CpG suppression was highest in construct 2.
An experiment of CpG enrichment from 02 CpGs to 39 CpGs in mutant L and 02 to 43 CpGs in LCG-HI in HIV-1 demonstrated~100-fold lower replication than WT in primary lymphocytes [100]. In MEF-1 poliovirus, a type 2 wild poliovirus prototype strain with neurovirulence in humans, with the increasing substitutions, virus fitness was decreased but reduced most efficiently by increasing the frequencies of CpG and UpA dinucleotides [101]. The changes were brought in capsid region and CpG high constructs, namely ABc7 (80 CpG and 34 TpA) and ABc8 (90 CpG and 26 TpA), which exhibited a reduction in relative plaque area and relative plaque yields compared to reference construct having 28 CpG and 36 TpA. In the Influenza virus, smaller plaque sizes in CpG-high and TpA-high mutants were observed than in WT or permuted virus that brought no changes in overall A/T composition [102]. In the present study, in constructs 2 and 3, the CpG content was increased from 98 (for native delta construct) to 237 and 227, respectively, so we may expect the reduction of expression in our recoded constructs also. In the E7 genome, two segments were taken for the study, contributing to 16.7% and 14.2% of the full-length genome. CpG or TpA dinucleotides were altered from both regions. It was possible to reduce the CpG and TpA frequencies to approximately one-third or to enhance to 2.5-3fold the wild-type levels in a gene sequence. The infectivity of permuted control sequence was similar to that of the wild type. CpG high in both segments resulted in viral output approximately 7000-fold lower, while TpA high in both segments had approximately 30-fold lower viral output after 24 h. This means the attenuation was higher for CpG enhancement than for TpA enhancement [40]. CpG and TpA alteration with their impact on virus replication has been given in Table 8. The same concord with our results, where we found introducing rare codon and codon pair disruption more effective than enhancing TpA content.
Regarding the role of spacing between CpGs, it is demonstrated that when CpG is present in pairs, the DC stimulation is enhanced, and CD8 T cells are highly activated [103]. In the present study, in the native construct, no CpG dimer was present, while 04, 09, and 06 CpG dimers were present in constructs 1, 2, and 3. More CpGs in the sequence lead to increased IL10 and IL12 secretion [103]; thus highest IL10 and IL12 secretion will be there with construct 2.  Compositional analysis of each construct revealed that in native and construct 1, where TpT was converted to TpA ending codons, as expected, T/A composition was the same (60.2%). While for constructs 2 and 3, overall AT composition was less than the native construct, 56.11% and 54.59%, respectively. We have examples of viruses like IAV [102] or Zika [104] where, despite no changes in overall genomic AT/GC composition, enhancement in CpG and TpA, attenuated virus. On the other hand, in polio [101] and the E7 virus [40], CpG and TpA alterations with genome composition changes attenuated viruses, suggestive of a role of CpG and TpA in virus attenuation rather than composition.

Discussion
The SARS-CoV-2 delta strain has caused millions of death and has already proven to be one of the significant variants of concern. To counter the virus, scientists worldwide are continuously working to develop efficacious vaccine candidates. Since vaccine candidate development is a time-consuming process, if preliminary studies are done in-silico, it will save much time and resources. In the present study, we envisaged different molecular features of four structural genes, E, M, NP and S, from the perceptive of vaccine candidate development. The analysis is helpful in incorporating essential features during designing through the synthetic biology approach, and later, the viable attenuated virus can be rescued by using the reverse genetics approach.
Pasteurella multocida is an avian cholera pathogen, and to construct an attenuated vaccine candidate, it was cultured from 37 • to 45 • . Among many descendants, one developed with low virulence and high immunogenicity [106]. In the experiment of Xia and colleagues [107], genomic features, including the GC content and dinucleotide frequencies, were envisaged to identify possible reasons behind thermal attenuation. In the attenuated low pathogenicity strain, the GC content was low despite the fact that more GC content would enhance thermal stability during raised temperature, and GC-rich codons encoded amino acids alanine and arginine would impart in thermal stability of the proteins. Investigation of other attenuated viruses revealed that without altering overall genomic AT/GC composition, only enhancement in CpG and TpA content attenuated viruses like Zika [104] and IAV [102]. Contrarily, in the polio [101] and E7 viruses [40], CpG and TpA alterations with genome composition changes attenuated the viruses, suggesting that the CpG and TpA have a more significant role in attenuation than composition.
Selection for disfavoured codon pairs leads to unintended increases in CpG and UpA dinucleotide frequencies that also attenuate replication. In the viral genomes, CpG and UpA dinucleotides are present at low frequencies. Tulloch and colleagues manipulated a human gut virus, namely echovirus 7, where they made two sets of mutations. In one set, the codon pair frequencies were altered, keeping CpG and TpA constant. In contrast, in the second set, codon pair frequencies were kept the same while the CpG and TpA content was altered. The results revealed that alteration in codon pair frequency doesn't alter the viral fitness, but an increase in CpG or TpA weakens the virus, and it is possibly attributed to the viruses being targeted readily by the host immune system post increase in CpG content and not due to altered virus fitness [19]. Considering all the facts together, we suggest that while constructing SARS-CoV-2 vaccine candidates through synthetic biology approach, CG or TA content should be optimized in a way that CG content should neither be that low for the virus that helps in escaping the host defense system nor should be too high that before eliciting sufficient immune response it is eliminated by the immune system. Furthermore, the ApA, TpA, and TpT dinucleotides were higher, and those of ApT, GpC, and CpG dinucleotides were lower in the vaccine strain of the P. multocida strain than in the virulent strain. In the structural genes of the delta strain of SARS-CoV-2, TpT, ApA, CpT and TpG were overrepresented, while GpG, CpC, GpA, and CpG dinucleotides were underrepresented.
While constructing the vaccine candidate with the synthetic biology approach, knowing how much nucleotide content needs to be changed to get attenuation is essential. In echovirus 7, ten-fold or greater attenuation in cell culture was achieved by replacing >12-15% of the genome with codon pair deoptimized sequences that typically increased the frequencies of CpG and UpA from 0.4-0.6 to 1.4-1.6 (CpG) and from 0.5-0.8 to 1.1-1.4 (UpA) respectively [19].
Since TpA is commonly underrepresented in organisms [108,109], a further decrease in TpA content is helpful in attenuating viruses for developing vaccine candidates against animal viruses also. One example is the classical swine fever virus (CSFV), where the codon deoptimization technique was used in the glycoprotein E2-coding region of CSFV, where deoptimization increased TpA. Inoculation with this virus showed the animal's survival and remained clinically normal [110], indicating efficacy as a vaccine candidate for animal use. On the other hand, the Minute virus of mice, a Parvovirus, exhibited no attenuation followed by increasing TpA and showed a similar replication pattern as of wild-type virus [111]. Thus the effect of the elevation of TpA has a virus-specific impact and needs to be tested for viruses separately.
Hence it is suggested that for constructing the SARS-CoV-2 vaccine candidate, the overall permissible change in a genome is 10-15%. It is noteworthy that not all the ORFs experience the same degree of CpG suppression. CpG suppression is least in the E gene and ORF10, and both use underrepresented codon pairs, and CpG usage is high compared to other ORFs [112]. Hence while deciding the CpG content, it is also essential to keep in mind the original CpG usage of individual ORFs. This observation will be relevant to future strategies for a rationally attenuated SARS-CoV-2 vaccine.
For codon deoptimization, one may have the choice of using non-optimal codons or codon pairs from the host or virus itself (mentioned in the above section). In the present study, we envisaged various parameters like preferred codons, preferred codon pairs, and rare codons that may be used to recode the virus genetic sequence and design a codondeoptimized vaccine candidate. We analyzed both the preferred and rejected codon pairs for gene recoding. Codons CGG (Arg), CCG (Pro) and CAC (His) were rare in all the genes envisaged. GGG (Gly), CCC (Pro), and TCG (Ser) codons were rare in at least three genes. Codons ACG, CAC, CCG, CGA, CGG, CGC, GCG, GGG, and TCG were rare in the delta strain, and all other envisaged strains and had a frequency below 0.5% except for CGC having slightly higher (0.62%) in Sarbacoviruses. Judicial usage of these rare codons, along with their intelligent placements (like placements near the 5 region) in the recoded virus, is expected to attain an attenuated phenotype with the ability to evoke an immune response.
Generally, it is considered that for obtaining an attenuated vaccine candidate, it is essential to incorporate deoptimized (rare) codons [113] or codon pairs [7]. But there are examples where attenuated vaccine candidates have been designed by using excess optimized codon pairs. For example, in the attempt to construct an attenuated vaccine candidate against Vesicular Stomatitis Virus (VSV) by computer-aided design, two recombinant versions were prepared. One version contained the excess underrepresented codon pairs (L1Min), and the other one contained excess overrepresented codon pairs (L1sdmax), where all the manipulations were carried out into the polymerase gene 'L.' Multistep growth kinetics and plaque phenotypes of the wild type and the engineered one revealed that the L1sdmax version was both immunogenic and attenuated. This attenuation was not host range specific since it generated small plaques in all the cell lines tested [114]. This observation is likely attributed to overrepresented codon pairs altering the translation rate, leading to disrupted coordination between translation and protein folding. Here it is important to note that CpG is more effective in attenuation than TpA. The evidence is from the Influenza A virus, where both CpG and TpA high viruses were attenuated with 10-100 fold reductions in the viral loads in the lungs of infected mice. However, the pathogenicity of CpG-high viruses was substantially reduced [102]. The E7 virus was modified using 02 segments representing 16.7% and 14.2% of the full-length genome. When both segments were replaced with CpG high or TpA high segments, 100-to 10,000-fold impairments in replication were observed. However, out of two segments, if only one segment was CpG or TpA high and the second segment was WT, then CpG high/WT combination exhibited 144-fold less replication. Contrarily, TpA high/WT exhibited 10 folds greater amplification [40]. All the points indicate the more significant role of CpG in attenuation than TpA.
ACT-, AAT-, TTT-and TTG-initiated codons (Threonine, Asparagine, Phenylalanine and leucine) were preferred in at least 3 genes out of the four envisaged, whereas GTT-, GGA and CTT-initiated (Val, Gly and Leu) codons were preferred in at least two genes. In the E gene, Phenylalanine-, Leucine-, Serine-, and Tyrosine-initiated codons are preferred in all genes. In the delta, Valine-initiated codons were also preferred, while in other VoCs and Sarbecoviruses, Serine initiated (06) codon pairs were preferred. Phenylalanine and Leucine-initiated codon pairs were preferred for the M gene in all envisaged strains. In the N gene and S genes, Glycine-initiated codon pairs were preferred in all VoCs, including delta strain apart from Sarbecoviruses. Since codon preference is similar for all viruses, similarity in codon pair preferences is also expected, and as expected, many of the codon pair usages are the same for delta compared to other strains; however, some unique patterns were also present, which could be molecular signatures. Authors suggest investigators use the information where highly preferred codon pairs are initiated with specific codons for recoding viruses through excessive codon optimization or deoptimization.
Since the genetic code is redundant, 18 out of 20 amino acids are encoded by two, three, four, or six synonymous codons. The observed usage of these codons is different from what is expected and called codon bias. This bias can be species-specific and correlated with the tRNA pool. Together the tRNA pool and codon usage determine how efficiently a protein will be translated. Since the virus depends on host cell cellular machinery for protein translation, many viral genomes contain more host-preferred codons. In an elegant work of Chen et al.,(2020), it was demonstrated that if host codon usage is similar to that of viral codon usage, it reduces the burden on host translation machinery while increasing viral gene expression. Human genes, which have a similar codon usage pattern to viral genes, are upregulated during infection between the host and virus is very similar for symptomatic hosts than natural hosts [115], suggesting more severe outcomes of having high codon usage similarities between the host and the pathogen. SARS-CoV-2 delta strain virus also preferred codons; for them, fewer isoacceptor tRNAs were present, and it appears to be a strategy where, in the key structural motifs, the pace of translation is kept low to facilitate proper folding of viral protein. Our result corresponds to the results found in the Nipah Virus by Khandia et al. [74], where a suboptimal tRNA pool was used for encoding viral genes. Similarly, in the hepatitis A virus (HAV), which presents a highly biased codon usage as opposed to the host codon usage and usage of inefficient IRES, the virus is able to synthesize its proteins owing to the usage of less abundant tRNA pool of host that results in a poor replication rate, and thus it is difficult to culture virus in cell culture [116]. The attempt to make the tRNA pool more available to the HAV virus, in fact, resulted in a loss of fitness and which later recovered through a re-deoptimization [117]. Based on our study, we concluded that while constructing attenuated vaccine candidates through synthetic biology approach using structural genes, disruption of favored codon pairs is a better strategy compared to incorporating "one to stop" TA dinucleotide or incorporating rare codons. Further, through reverse genetics, the desired deoptimized recombinant recoded virus may be rescued from cell culture and used to investigate efficacy and protection in the future.

Conclusions
Virus recoding, taking advantage of the synthetic biology approach, is an emerging technique in constructing vaccine candidates. Changing the overall nucleotide composition, CpG and TpA content, codon or codon pair deoptimization or excessive optimization, and knowledge related to the host tRNA pool are a few strategies currently being adapted for attenuating pathogens for vaccine candidate development. In the present study, we envisaged various molecular features of four structural genes of the SARS-CoV-2 virus delta strain. The study's outcome encompasses information relating to the overall composition, where we found the genome rich in A/T nucleotides, specifically in T nucleotides. The information related to dinucleotide proportion may be used to carefully recode the virus so that its CpG content remains in a way that it will not escape the immune response.
On the other hand, it should not be too high to be removed instantly by the immune system. Codons CGG (Arg), CCG (Pro) and CAC (His) are rare in all the envisaged genes, while most of them preferred ACT-, AAT-, TTT-, and TTG-initiated codons. We also envisaged a positive codon context in S gene. The information related to the human isoacceptor tRNA pool and preferred codons in delta virus also might be helpful while considering the codons while recoding. Overall the information generated in the present study will be beneficial for researchers who are considering synthetic biology approach to develop a vaccine candidate again the deadly SARS-CoV-2 strain, and they may choose to have various options in combination to achieve a safe and efficacious vaccine candidate. Instead of incorporating rare codons, disruption of favored codon pairs is a more viable strategy in obtaining better vaccine candidates owing to both reduced protein expression and lower transcript stability.